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Abstract 

Markov Chain Monte Carlo (MCMC) algorithms 
are often used for approximate inference inside 
learning, but their slow mixing can be difficult to 
diagnose and the approximations can seriously 
degrade learning. To alleviate these issues, we 
define a new model family using strong Doe- 
blin Markov chains, whose mixing times can 
be precisely controlled by a parameter. We 
also develop an algorithm to learn such models, 
which involves maximizing the data likelihood 
under the induced stationary distribution of these 
chains. We show empirical improvements on two 
challenging inference tasks. 

1. Introduction 

Conventional wisdom suggests that rich features and 
highly-dependent variables necessitate intractable infer¬ 
ence. Indeed, the dominant paradigm is to first define 
a joint model, and then use approximate inference (e.g., 
MCMC) to learn that model. While this recipe can generate 
good results in practice, it has two notable drawbacks: (i) 
diagnosing convergence of Markov chains is extremely dif¬ 
ficult (Gelman and Rubin, 1992; Cowles and Carlin, 1996); 
and (ii) approximate inference can be highly suboptimal 
in the context of learning (Wainwright, 2006; Kulesza and 
Pereira, 2007). 

In this paper, we instead use MCMC to define the model 
family itself: For a given T, we construct a family of 
Markov chains using arbitrary rich features, but whose 
mixing time is guaranteed to be at most 0{T). The corre¬ 
sponding stationary distributions make up the model fam¬ 
ily. We can think of our Markov chains as parameteriz¬ 
ing a family of “computationally accessible” distributions, 
where the amount of computation is controlled by T. 

For concreteness, suppose we are performing a structured 
prediction task from input x to a complex output y. We 
construct Markov chains of the following form, called 


strong Doeblin chains (Doeblin, 1940): 

A 0 {yt I yt-i,x) = (1 - e)Ae{yt \ yt-i,x) + euB{yt \ x), 

( 1 ) 

where e is a mixture coefficient and 6 parameterizes Ag and 
ug. Importantly, ug does not depend on the previous state 
yt-i- For intuition, think of ug as a simple tractable model 
and Ag as Gibbs sampling in a complex intractable model. 
With probability 1 — e, we progress according to Ag, and 
with probability e we draw a fresh sample from ug, which 
performs an informed random restart. When e = 1, we 
are drawing i.i.d. samples from ug', we therefore mix in a 
single step, but our stationary distribution must necessarily 
be very simple. When e = 0, the stationary distribution can 
be much richer, but we have no guarantees on the mixing 
time. For intermediate values of e, we trade off between 
representational power and mixing time. 

A classic result is that a given strong Doeblin chain mixes 
in time at most ^ (Doeblin, 1940), and that we can draw an 
exact sample from the stationary distribution in expected 
time 0(i) (Corcoran and Tweedie, 1998). In this work, we 
prove new results that help us understand the strong Doe¬ 
blin model families. Let if and if be the family of station¬ 
ary distributions corresponding to Ag and Ag as defined in 
(1), respectively. Our first result is that as e decreases, the 
stationary distribution of any Ag monotonically approaches 
the stationary distribution of the corresponding Ag (as mea¬ 
sured by either direction of the KL divergence). Our sec¬ 
ond result is that if ^ is much larger than the mixing time 
of Ag, then the stationary distributions of Ag and Ag are 
close under a certain Mahalanobis distance. This shows 
that any member of T that is computationally accessible 
via the Markov chain is well-approximated by its counter¬ 
part in F. 



The figure above shows F and F, together with the subset 
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J'o of T whose Markov chains mix quickly. T (approxi¬ 
mately) covers Tq, and contains some distributions outside 
of T entirely. 

In order to learn over T, we show how to maximize the 
likelihood of the data under the stationary distribution of 
Aff. Specifically, we show that we can compute a stochastic 
gradient of the log-likelihood in expected time O(-). Thus, 
in a strong sense, our objective function explicitly accounts 
for computational constraints. 

We also generalize strong Doeblin chains, which are a mix¬ 
ture of two base chains, ug and Ag, to staged strong Doe¬ 
blin chains, which allow us to combine more than two base 
chains. We introduce an auxiliary variable z representing 
the “stage” that the chain is in. We then transition between 
stages, using the base chain corresponding to the current 
stage z to advance the concrete state y. A common ap¬ 
plication of this generalization is defining a sequence of 
increasingly more complex chains, similar in spirit to an¬ 
nealing. This allows sampling to become gradually more 
sensitive to the structure of the problem. 

We evaluated our methods on two tasks: (i) inferring words 
from finger gestures on a touch screen and (ii) inferring 
DNF formulas for program verification. Unlike many 
structured prediction problems where local potentials pro¬ 
vide a large fraction of the signal, in the two tasks above, 
local potentials offer a very weak signal; reasoning care¬ 
fully about the higher-order potentials is necessary to per¬ 
form well. On word inference, we showed that learning 
strong Doeblin chains obtained a 3.6% absolute improve¬ 
ment in character accuracy over Gibbs sampling while re¬ 
quiring 5x fewer samples. On DNF formula inference, our 
staged strong Doeblin chain obtains an order of magnitude 
speed improvement over plain Metropolis-Hastings. 

To summarize, the contributions of this paper are: We for¬ 
mally define a family of MCMC algorithms based on strong 
Doeblin chains with guaranteed fast mixing times (Sec¬ 
tion 2). We provide an extensive analysis of the theoretical 
properties of this family (Section 3), together with a gener¬ 
alization to a staged version (Section 3.1). We provide an 
algorithm for learning the parameters of a strong Doeblin 
chain (Section 4). We demonstrate superior experimental 
results relative to baseline MCMC samplers on two tasks, 
word inference and DNF formula synthesis (Section 5). 

2. A Fast-Mixing Family of Markov Chains 

Given a Markov chain with transition matrix A{yt \ yt-i) 
and a distribution uijjt), define a new Markov chain with 

transitions given by A(yt | yt-i) (l-e)A(yt | yt-i) + 
^u{yt). (We suppress the dependence on 0 and x for now.) 


Stationary Probability vs. Restart Probability 




Figure 1. Left: A simple 3-state Markov chain. Arcs denote transi¬ 
tion probabilities. Right: plot of the stationary probability of state 
1 as a function of restart probability e, for S = 10“^. Note the two 
regimes for e ^ <5 and e S> 5. 

In matrix notation, we can write A as 

i'^= (l-e)A-beMl^. (2) 

In other words, with probability e we restart from m; oth¬ 
erwise, we transition according to A. Intuitively, A should 
mix quickly because a restart from u renders the past inde¬ 
pendent of the future (we formalize this in Section 3). We 
think of M as a simple tractable model that provides cover¬ 
age, and A as a complex model that provides precision. 

Simple example. To gain some intuition, we work 
through a simple example with the Markov chain A de¬ 
picted in Figure 1 . The stationary distribution of this chain 
[ 2 ^^ 24^5 2 tj^ ] ’ splitting most of the probabil¬ 

ity mass evenly between states 1 and 3. The mixing time 
of this chain is approximately j, since once the chain falls 
into either state 1 or state 3, it will take about y steps for it 
to escape back out. If we run this Markov chain for T steps 
with T ^ y, then our samples will be either almost all in 
state 1 or almost all in state 3, and thus will provide a poor 
summary of the distribution. If instead we perform ran¬ 
dom restarts with probability e from a uniform distribution 
u over {1, 2, 3}, then the restarts give us the opportunity 
to explore both modes of the distribution. After a restart, 
however, the chain will more likely fall into state 3 than 
state 1 (| probability vs. |), so for e ^ 6 the stationary 
distribution will be noticeably perturbed by the restarts. If 
e 6, then there will be enough time for the chain to mix 
between restarts, so this bias will vanish. See Figure 1 for 
an illustration of this phenomenon. 

3. Theoretical Properties 

Markov chains that can be expressed according to (2) are 
said to have strong Doeblin parameter e (Doeblin, 1940). 
In this section, we characterize the stationary distribution 
and mixing time of A, and also relate the stationary distri¬ 
bution of A to that of A as a function of e. Often the easiest 
way to study the mixing time of A is via its spectral gap. 
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which is defined as 1 — A 2 (A), where A 2 (A) is the second- 
largest eigenvalue (in complex norm). A standard result for 
Markov chains is that, under mild assumptions, the mixing 
time of A is O 


(i-A2(A))- 


We assume throughout this sec¬ 
tion that A is ergodic but not necessarily that it is reversible. 
See Section 12.4 of Levin et al. (2009) for more details. 


Our first result relates the spectral gap (and hence the mix¬ 
ing time) to e. This result (as well as the next) are well- 
known but we include them for completeness. For most 
results in this section, we sketch the proof here and provide 
the full details in the appendix. 

Proposition 3.1. The spectral gap of A is at least e; that 
is, 1 — A 2 (A) > e. In particular, A mixes in time O(^). 


The key idea is that all eigenvectors of A and A (except 
for the stationary distribution) are equal, and that Afe(A) = 
(1 — e)Afc(A) for fc > 1. 

Having established that A mixes quickly, the next step is to 
determine its stationary distribution; 

Proposition 3.2. Let tt be the stationary distribution of A. 
Then 

OO 

TT = e ^^(1 — ey A^u = e{I — (1 — e)A)~^u. (3) 

j=o 


This can be directly verified algebraically. The summa¬ 
tion over j shows that we can in fact draw an exact sample 
from TT by drawing T ~ Geometric(e),' initializing from 
u, and transitioning T times according to A. This is intu¬ 
itive, since at a generic point in time we expect the most 
recent sample from u to have occurred Geometric(e) steps 
ago. Note that E[T -f 1] = which is consistent with the 
fact that the mixing time is O(^) (Proposition 3.1). 

We would like to relate the stationary distributions tt and tt 
of A and A. The next two results (which are new) do so. 

Let TTe denote the stationary distribution of A at a particular 
value of e; note that tti = u and tto = tt. We will show that 
fi’e approaches tt monotonically, for both directions of the 
KL divergence. In particular, for any e < 1, tTj is at least as 
good as u at approximating tt. 

To show this, we make use of the following lemma from 
Murray and Salakhutdinov (2008): 

Lemma 3.3. If B is a transition matrix with stationary 
distribution tt, then KL (tt || Btt') < KL (tt || tt') and 
KL (Btt' || tt) < KL (tt' || tt). 

Using this lemma, we can prove the following monotonic¬ 
ity result: 

*If T ~ Geometric(e), we have P[r = j] = e(l — e)^ for 
j>0. 



Figure 2. Markov chains over (a) two stages (strong Doeblin 
chains); and (b) three stages (restart from u followed by transi¬ 
tions from Al and then from A 2 ). 

Proposition 3.4. Both KL (tt,; || tt) and KL (tt || Irf) are 
monotonic functions of e. 

The idea is to construct a transition matrix B that maps 
TTej to Tr,:^ for given £2 < ei. then show that its stationary 
distribution is tt and apply Lemma 3.3. 

With Proposition 3.4 in hand, a natural next question is how 
small e must be before tt is reasonably close to tt. Propo¬ 
sition 3.5 provides one such bound: tt is close to tt if e is 
small compared to the spectral gap 1 — A 2 (A). 

Proposition 3.5. Suppose that A satisfies detailed balance 
with respect to tt. Let tt be the stationary distribution of A. 

Define d^(TT') ='" ||7r-7r'||diag(,i)-i = -I + Ey 

where ||w||m the Mahalonobis distance s/ Mv. Then 
d-w(^) < i_a 2 (a) ■ '^7r(u)- (In particular, ^ 1 

e < (1 - A2(A))/d,r(u).) 

The proof is somewhat involved, but the key step is to es¬ 
tablish that dTr(Tr') is convex in tt' and contractive with re¬ 
spect to A (more precisely, that ((.^.(ATr') < A2(A)(i,r(7r')). 

Proposition 3.5 says that if A mixes quickly, then A and A 
will have similar stationary distributions. This serves as a 
sanity check: if A already mixes quickly, then if is a good 
approximation to tt; otherwise, the Doeblin construction 
ensures that we are at least converging to some distribution, 
which by Proposition 3.4 approximates tt at least as well as 
u does. 

3.1. Staged strong Doeblin chains 

Recall that to run a strong Doeblin chain A, we first sam¬ 
ple from u, and then transition according to A for approx¬ 
imately - steps. The intuition is that sampling from the 
crude distribution u faciliates global exploration of the state 
space, while the refined transition A hones in on a mode. 
However, for complex problems, there might be a consider¬ 
able gap between what is possible with exact inference (it) 
and what is needed for accurate modeling (A). This moti¬ 
vates using multiple stages of MCMC to bridge the gap. 

To do this, we introduce an auxiliary variable z G Z denot¬ 
ing which stage of MCMC we are currently in. For each 
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stage z, we have a Markov chain Az{yt \ yt-i) over the 
original state space. We also define a Markov chain C{zt \ 
Zt-i) over the stages. To transition from (yt_i,zt_i) to 
{yt,Zt), we first sample Zt from C{zt \ Zt-i) and then 
yt from Az^{yt \ yt-i)- If there is a special state z* for 
which Az‘~{yt \ yt-i) = u{yt) (i-C-, does not depend 
on yt-i), then we call the resulting chain a staged strong 
Doeblin chain. 

For example, if z € {Oj 1} ™d we transition from 0 to 1 
with probability 1 — e and from 1 to 0 with probability e, 
then we recover strong Doeblin chains assuming z* = 0 
(Figure 2(a)). As another example (Figure 2(b)), let z € 
{0,1,2}. When z = z* = 0, transition according to a 
restart distribution when z = 1, transition according 
to a simple chain Ai; and when z = 2 transition according 
to a more complex chain A 2 . If we transition from z = 0 
to z = 1 with probability 1, from z = 1 to z = 2 with 
probability ei, and from z = 2 to z = 0 with probability £ 2 , 
then we will on average draw 1 sample from u, A samples 
from Ai, and A samples from A 2 . 

We now show that staged strong Doeblin chains mix 
quickly as long as we visit z* reasonably often. In par¬ 
ticular, the following theorem provides guarantees on the 
mixing time that depend only on z* and on the structure of 
C{zt I Zt-i), analogous to the previous dependence only 
on e. The condition of the theorem asks for times a and b 
such that the first time after a that we hit z* is almost inde¬ 
pendent of the starting state zq, and is less than b with high 
probability. 

Theorem 3.6. Let M be a staged strong Doeblin chain on 
Z y. y. Let Ta be the earliest time s > a for which Zs = 
z*. Let /3a,s = minzez P[Ta = s \ zq = z] and ja,b ’= 
X]s=a (^a,s- Then has strong Doeblin parameter ^a,b- 
In particular, the spectral gap ofM is at least 2^. (Setting 
a = b = 1 recovers Proposition 3.1.) 

The key idea is that, conditioned on Ta, {yb, Zb) is indepen¬ 
dent of (yo, zo) for all b > Ta. For the special case that the 
stages form a cycle as in Figure 2, we have the following 
corollary; 

Corollary 3.7. Let C be a transition matrix on {0, ... ,k — 
1} such that C{zt = i \ Zt-i = i) = \ — 5i and C{zt = 
(i -f 1) mod k I Zt-i = i) = Si. Suppose that S^-i < 
max( 2 ,fc-i) , 4-2}- Then the spectral gap of 

the joint Markov chain is at least 

The key idea is that, when restricting to the time interval 
[2/4-1) 3/4-i]) the time of first transition from fc — 1 to 
0 is approximately Geometric(4-i)-distributed (indepen¬ 
dent of the initial state), which allows us to invoke Theo¬ 
rem 3.6. We expect the optimal constant to be much smaller 
than 78. 


4. Learning strong Doeblin chains 

Section 3 covered properties of strong Doeblin chains (1 — 
e)As -\- eugl^ for a fixed parameter vector 9. Now we turn 
to the problem of learning 9 from data. We will focus on the 
discriminative learning setting where we are given a dataset 
and want to maximize the conditional log- 

likelihood; 

1 " 

^>(0) = - V logpe{y^"'> I (4) 

n z —' 

i-l 

where now pg is the stationary distribution of Ag = (1 — 
e)Ag -f eugl^. We assume for simplicity that the chains 
Ag and ug are given by conditional exponential families; 

z^e{y I y',x) exp (9^(j){x,y',y) - log Z{9;x,y)) , 

ue{y I x) exp {9^(j){x, y) - log Z{9-, x)) , (5) 

where each f outputs a feature vector and the Z are parti¬ 
tion functions. By Proposition 3.1, Ag mixes quickly for all 
9. On the other hand, the parameterization of Ag captures a 
rich family of transition kernels, including Gibbs sampling. 

At a high level, our learning algorithm performs stochas¬ 
tic gradient descent on the negative log-likelihood. How¬ 
ever, the negative log-likelihood is only defined implicitly 
in terms of the stationary distribution of a Markov chain, 
so the main challenge is to show that it can be computed 
efficiently. To start, we assume that we can operate on the 
base chains ug and Ag for one step efficiently; 
Assumption 4.1. We can efficiently sample y from ug[- \ 
x) and Agf \ y',x), as well as compute 

d\ozAg(y\y’ ,x) 

ae 

Under Asssumption 4.1, we will show how to efficiently 
compute the gradient of | xA>) with respect to 

9. The impatient reader may skip ahead to the final pseu¬ 
docode, which is given in Algorithm 1 . 

For convenience, we will suppress the dependence on x and 
i and just refer to pe( 2 /) instead of | Comput¬ 

ing the gradient of \ogpg{y) is non-trivial, since the for¬ 
mula for pg is somewhat involved; 

OO 

Pe{y) = ty[Alug]{y). (6) 

We are helped by the following generic identity on gra¬ 
dients of conditional log-probabilities, proved in the ap¬ 
pendix. 

Lemma 4.2. Let z have distribution pg{z) parameterized 
by a vector 9. Let S be any measurable set. Then 

dlogpg(z € S) ^ \ dlogpg{z) 
d9 ^ 39 


G S 


(7) 
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We can utilize Lemma 4.2 by interpreting y | 0 as the out¬ 
put of the following generative process, which by Proposi¬ 
tion 3.2 yields the stationary distribution of Ag: 

• Sample yg from ug and j/t+i | yt from Ag for t = 

0,1,.... 

• Sample T ~ Geometric(e) and let y = yx- 

We then invoke Lemma 4.2 with z = (T, yo-.r) and S en¬ 
coding the event that yx = y- As long as we can sample 
from the posterior distribution of (T, yo-.x) conditioned on 
yx = y, we can compute an estimate of log pg{y) as 
follows: 

• Sample {T,yo-.x) \yT = y- 

m Return 9 log Pt) (LyO:T) _ dloguejyo) 

• JXCLUin gg — gg 

I 9logAg(yt\vt-i} 

l^t=l de 

4.1. Sampling schemes for (T, yo-.r) 

By the preceding comments, it suffices to construct a sam¬ 
pler for {T, yo-.x) \ yr = y- A. natural approach is to use 
importance sampling: sample {T,yo-.x-i), then weight by 
Piyr = y I yr-i)- However, this throws away a lot of 
work — we make T MCMC transitions but obtain only one 
sample (T, yo-.x) with which to estimate the gradient. 

We would like to ideally make use of all the MCMC transi¬ 
tions when constructing our estimate of (T, yo-.r) \ yr = y- 
For any t < T, the pair (f,yo:t) would itself have been 
a valid sample under different randomness, and we would 
like to exploit this. Suppose that we sample T from some 
distribution F and let q{t) be the probability that T > t 
under F. Then we can use the following scheme: 

• Sample T from F, then sample yo:T-i- 

• For f = 0, ...,T, weight {t,yo-.t-i) by x 

p{yt = y I yt-i)- 

For any q, this yields unbiased (although unnormalized) 
weights (see Section B in the appendix). Typically we will 
choose q{t) = (1 — e)*, e.g. F is a geometric distribu¬ 
tion. If the yt are perfectly correlated, this will not be 
any more effective than vanilla importance sampling, but 
in practice this method should perform substantially bet¬ 
ter. Even though we obtain weights on all of yo-.r, these 
weights will typically be highly correlated, so we should 
still repeat the sampling procedure multiple times to min¬ 
imize the bias from estimating the normalization constant. 
The full procedure is given as pseudocode in Algorithm 1. 

4.2. Implementation 

With the theory above in place, we now describe some im¬ 
portant implementation details of our learning algorithm. 


Algorithm 1 Algorithm for computing an estimate of 
^ logpe{y I x). This estimate is unbiased in the limit 
of infinitely many samples k, but will be biased for a finite 
number of samples due to variance in the estimate of the 
partition function. 

SampleGradient(a:, y, 9, e, k) 

\> k is the number of samples to take 
Z ^ 0; g ^ 0 \> Z is the total importance mass of all 

samples, ^ is the gradient 
for i = 1 to fc do 

Sample T ~ Geometric(e) 

Sample yo from ug{- \ x) 

For 1 < f < T—1: sample yt fmm.Ag{- \ yt-i,x) 
wo^ e- ug{y) 

For 1 < f < T: wt ^ e • Ag{y \ yt-i,x) 

z ^ z + Yl^=o'^t 

I „„ ( aiogti8(yo|a:) 

+ Z^t=i -gr“ ' ” 

d^ogAe(ys\ys-i,x) . dlog Ag{y\yt-i,x) \ 
■'■Z^s=l d9 de )■ 

end for 

Output ^ 


At a high level, we can just use Algorithm 1 to compute 
estimates of the gradient and then apply an online learning 
algorithm such as AdaGrad (Duchi et al., 2011) to iden¬ 
tify a good choice of 9. Since the log-likelihood is a non- 
convex function of 9, the initialization is important. We 
make the following (weak) assumption: 

Assumption 4.3. The chains ug and Ag are controlled by 
disjoint coordinates of 9, and for any setting ofug there is 
a corresponding choice of Ag that leaves ug invariant (i.e., 
AgUg = Ug). 

In practice. Assumption 4.3 is easy to satisfy. For in¬ 
stance, suppose that ^ is a feature func¬ 
tion, 9 = [9o 9 1 ] e jjjg features controlling 

u and A, and ug^ is made tractable by zeroing some fea¬ 
tures out: ueg{y) oc exp([6>o Og-doVHy))- Also sup¬ 
pose that Ag^^ is a Gibbs sampler that uses all the features: 
Agiiy I y') oc exp{9j(j){yi,yA)), where i is a randomly 
chosen coordinate of y. Then, we can satisfy Assump¬ 
tion 4.3 by setting 9i = [9o Od-do]- 

Under Assumption 4.3, we can initialize 9 by first training 
u in isolation (which is a convex problem if ug parameter¬ 
izes an exponential family), then initializing A to leave u 
invariant; this guarantees that the initial log-likelihood is 
what we would have obtained by just using u by itself. We 
found this to work well empirically. 

As another note. Algorithm 1 naively looks like it takes 
O(r^) time to compute the gradient for each sample, due 
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x: bdsadbnnnfaassjjj 
z: b # # a # # n-n-n # a-a # # n-n a 

y: b a n a n a 


Figure 3. Generated sequence of keyboard gestures for the word 
banana. The input 1 is a sequence of characters (the recorded key 
presses), and the output y is the intended word. Most characters 
in X are incidental and do not correspond to any character in j/; 
this is reflected by the (unobserved) alignment z. 

to the nested sum. However, most terms are of the form 
9 ; by grouping them for a hxed s we 

can compute the sum in 0{T) time, leading to expected 
runtime O (*) for Algorithm 1 (since E[T + 1] = 7 ). 

5. Experiments 

We validated our method on two challenging inference 
tasks. These tasks are difficult due to the importance of 
high-arity factors; local information is insufficient to even 
identify high-probability regions of the space. 

Inferring Words from Keyboard Gestures We hrst 
considered the task of inferring words from keyboard ges¬ 
tures. We generated the data by sampling words from the 
New York Times corpus (Sandhaus, 2008). For each word, 
we used a time series model to synthetically generate hn- 
ger gestures for the word. A typical instantiation of this 
process is given in Figure 3. The learning task is to dis- 
criminatively infer the intended word y given the sequence 
of keys x that the finger was over (for instance, predicting 
banana from bdsadbnnnfaassjjj). In our model, 
we posit a latent alignment z between key presses and in¬ 
tended letter. Given an input x of length I, the alignment z 
also has length l\ each Zi is either ‘c’ (xi starts an output 
letter c), ‘-c’ (a;^ continues an output letter c), or ‘#’ (xi is 
unaligned); see Figure 3 for an example. Note that y is a 
deterministic function of z. 

The base model ug consists of indicator features on {xi, zi), 
{xi, Zi-i, Zi), and {xi,Xi-i, Zi). The full Ag is a Gibbs 
sampler in a model where we include the following features 
in addition to those above: 

• indicator features on (xi, yt, yi-i) 

• indicator of y being in the dictionary, as well as log of 
word frequency (conditioned on being in the dictionary) 

• for each i, indicator of yi-j matching a prehx of a word 
in the dictionary 


We compared three approaches: 

• Our approach (Doeblin sampling) 

• Regular Gibbs sampling, initialized by setting Zi = Xi 
for all i (basic-Gibbs) 

• Gibbs sampling initialized from ug (ug-Gibbs) 

At test time, all three of these methods are almost identi¬ 
cal: they all initialize from some distribution, then make 
a certain number of Gibbs samples. For basic-Gibbs and 
Mg-Gibbs, this is always a fixed number of steps T, while 
for Doeblin sampling the number of steps is a geometric 
distribution with mean T. 

The main difference is in how the methods are trained. Our 
method is trained using the ideas in Section 4; for the other 
two methods, we train by approximating the gradient: 

V\ogpe{y I x) =Ezr^pe{z\x,y)[(l){y^Z.^)] 

^y,Z'^po(y,z\x) [0(^5 ^)]; 

where (f){y, z, x) is the feature function and pg is the sta¬ 
tionary distribution of Ag. For the second term, we use 
MCMC samples from Ag to approximate pg{y,z \ x). 
For the hrst term, we could take the subset of samples 
where y = y, but this is problematic if no such samples 
exist. Instead, we reweight all samples with y 7 ^ y by 
exp(—(D-fl)), where D is the edit distance between y and 
y. We use the same reweighting approach for the Doeblin 
sampler, using this as the importance weight rather than us¬ 
ing Ag{y \yt-i) as in Algorithm 1. 

To provide a fair comparison of the methods, we set e in the 
Doeblin sampler to the inverse of the number of transitions 
T, so that the expected number of transitions of all algo¬ 
rithms is the same. We also devoted the first half of each 
chain to burn-in. 

All algorithms are trained with AdaGrad (Duchi et ah, 
2011 ) with 16 independent chains run for each example. 
We measure word-level accuracy by computing the frac¬ 
tion of (non-burn-in) samples whose output y is correct. 

The results are reported in Figure 4. Overall, our Doe¬ 
blin sampler outperforms ug-Gibbs by a significant mar¬ 
gin, which in turn outperforms basic-Gibbs. Interestingly, 
while the accuracy of our method continues to improve 
with more training time, rtg-Gibbs quickly asymptotes and 
then slightly decreases, even for training accuracy. 

What is happening to ug-Gibbs? Since the inference prob¬ 
lem in this task is hard, the samples provide a poor gradient 
approximation. As a result, optimization methods that take 
the approximation at face value may not converge to even 
a local optimum. This phenomenon has already been stud¬ 
ied in other contexts, for instance by Kulesza and Pereira 
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Figure 4. Plots of word-level (left) and character-level (right) accuracy. The first panel gives the performance of all 3 methods (Doeblin 
chains, smart restarts, and simple Gibbs) for a computational budget of 20 transitions per example. The second and third panels give the 
performance of Doeblin chains and smart restarts, respectively, for increasing computational budgets (20, 50, and 100 transitions). 


(2007) and Huang et al. (2012). 

In contrast, our method directly optimizes the log- 
likelihood of the data under the distribution TTg, so that ac¬ 
curacy continues to increase with more passes through the 
training data. This demonstrates that the MCMC samples 
do provide enough signal to train from, but that naively 
plugging them into a method designed for exact inference 
will fail to exploit that signal. 

Inferring DNF Formulas We next study the use of our 
staged Doeblin chain construction as a tool for hierarchical 
initialization. We ignore learning for now, instead treat¬ 
ing MCMC as a stochastic search algorithm. Our task 
of interest is to infer a DNF formula / from its input- 
output behavior. This is an important subroutine in loop 
invariant synthesis, where MCMC methods have recently 
shown great promise (Gulwani and Jojic, 2007; Sharma and 
Aiken, 2014). 

Concretely, the input x might look like this: 

/(I, 2, 3) = True 
/(I, 4, 4) = True 
/(0,1, 0) = False 
/(O, 2, 2) = True 

Our task is to reconstruct /; in this case, f{xi,X 2 ,X 3 ) = 
[a;i 0] V [x2 = xs]. 

More formally, we consider DNF formulae with linear in¬ 
equality predicates: f{x) = Vr=i < bij], 

where aij,x G 1,^ and bij G Z. The formula / maps input 
vectors to {True, False}. Given a collection of example 
inputs and outputs, our goal is to find an / consistent with 
all examples. Our evaluation metric is the time to find such 


a formula. 

The search space for this problem is extremely large. Even 
if we set n = m = 3 and restrict our search to Uij G 
{ —1, 0,1}^, b G (—1,0,1}, the total number of candidate 
formulae is still (3®)^^^ « 5.8 x 10^®. 

We consider three MCMC methods: no restarts (0-stage), 
uniformly random restarts (1-stage), and a staged method 
(2-stage) as in Section 3.1. All base chains perform 
Metropolis-Hastings using proposals that edit individual 
atoms (e.g., [aj^x < bij]), either by changing a single entry 
of [aij bij] or by changing all entries of [a^- bij] at once. 
For the staged method, we initialize uniformly at random, 
then take Geometric(0.04) transitions based on a simpli¬ 
fied cost function, then take Geometric(0.0002) steps with 
the full cost (this is the staged Doeblin chain in Figure 2). 

The full cost function is /(/), the number of examples / 
errs on. We stop the Markov chain when it finds a formula 
with /(/) = 0. The simplified cost function decomposes 
over the disjuncts: for each disjunct d{x), if f{x) — False 
while d{x) = True, we incur a large cost (since in order for 
f{x) to be false, all disjuncts comprising f{x) must also be 
false). If f{x) = True while d{x) — False, then we incur 
a smaller cost. If f{x) = d(x) then we incur no cost. 

We used all three methods as a subroutine in verifying 
properties of C programs; each such verification requires 
solving many instances of DNF formula inference. Using 
the staged method we are able to obtain a 30% speedup 
over uniformly random restarts and a 50x improvement 
over no restarts, as shown in Table 1 . 
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Table 1. Comparison of 3 different MCMC algorithms. 0-stage uses no restarts, 1-stage uses random restarts, and 2-stage uses random 
restarts followed by a short period of MH with a simplified cost function. The table gives mean time and standard error (in seconds) 
taken to verify 5 different C programs, for 1000 trials. Each verification requires inferring many DNF formulae as a sub-routine. 


Task 

figl 

cegar2 

nested 

tacas06 

hard 

0 -stage 

2.6 ± 1.0 

320 ±9.3 

120 ±7.0 

> 600 

> 600 

1 -stage 

0.074 ±0.001 

0.41 ±0.01 

2.4 ±0.10 

6.8 ±0.15 

52 ± 1.5 

2 -stage 

0.055 ±0.005 

0.33 ±0.007 

2.3 ±0.12 

4.6 ±0.12 

31 ±0.90 


6. Discussion 

We have proposed a model family based on strong Doe- 
blin Markov chains, which guarantee fast mixing. Our 
construction allows us to simultaneously leverage a sim¬ 
ple, tractable model {ug) that provides coverage together 
with a complex, accurate model {Ag) that provides preci¬ 
sion. As such, we sidestep a typical dilemma—whether to 
use a simple model with exact inference, or to deal with the 
consequences of approximate inference in a more complex 
model. 

While our approach works well in practice, there are still 
some outstanding issues. One is the non-convexity of the 
learning objective, which makes the procedure dependent 
on initialization. Another issue is that the gradients re¬ 
turned by Algorithm 1 can be large, heterogeneous, and 
high-variance. The adaptive nature of AdaGrad allevi¬ 
ates this somewhat, but it would still be ideal to have a sam¬ 
pling procedure that had lower variance than Algorithm 1. 

Though Gibbs sampling is the de facto method for many 
practitioners, there are also many more sophisticated ap¬ 
proaches to MCMC (Green, 1995; Earl and Deem, 2005). 
Since our framework is orthogonal to the particular choice 
of transition kernel, it would be interesting to apply our 
method in these contexts. 

Finally, we would like to further explore the staged con¬ 
struction from Section 3.1. As the initial results on DNF 
formula synthesis are promising, it would be interesting to 
apply the construction to high-dimensional feature spaces 
as well as rich, multi-level hierarchies. We believe this 
might be a promising approach for extremely rich models 
in which a single level of re-initialization is insufficient to 
capture the complexity of the cost landscape. 

Related work. Our learning algorithm is reminiscent of 
policy gradient algorithms in reinforcement learning (Sut¬ 
ton et al., 2000), as well as Seam, which tries to learn an 
optimal search policy for structured prediction (Daume 111 
et al., 2009); see also Shi et al. (2015), who apply rein¬ 
forcement learning in the context of MCMC. Our staged 
construction is also similar in spirit to path sampling (Gel- 
man and Meng, 1998), as it uses a multi-stage approach to 
smoothly transition from a very simple to a very complex 


distribution. 

Our staged Doeblin construction belongs to the family of 
coarse-to-fine inference methods, which operate on pro¬ 
gressively more complex models (Viola and Jones, 2004; 
Shen et al., 2004; Collins and Koo, 2005; Charniak et al., 
2006; Carreras et al., 2008; Gu et al., 2009; Weiss et al., 
2010; Sapp et al., 2010; Petrov, 2011; Yadollahpour et al., 
2013). 

On the theoretical front, we make use of the well-developed 
theory of strong Doeblin chains, often also referred to 
with the terms minorization or regeneration time (Doe¬ 
blin, 1940; Roberts and Tweedie, 1999; Meyn and Tweedie, 
1994; Athreya and Ney, 1978). The strong Doeblin prop¬ 
erty is typically used to study convergence of continuous- 
space Markov chains, but Rosenthal (1995) has used it 
to analyze Gibbs sampling, and several authors have pro¬ 
vided algorithms for sampling exactly from arbitrary strong 
Doeblin chains (Propp and Wilson, 1996; Corcoran and 
Tweedie, 1998; Murdoch and Green, 1998). We are the 
first to use strong Doeblin properties to construct model 
families and learn them from data. 


At a high level, our idea is to identify a family of 
models for which an approximate inference algorithm is 
known to work well, thereby constructing a computation¬ 
ally tractable model family that is nevertheless more ex¬ 
pressive than typical tractable families such as low-tree- 
width graphical models. We think this general program is 
very interesting, and could be applied to other inference 
algorithms as well, thus solidfying the link between statis¬ 
tical theory and practical reality. 

References 

KB Athreya and P Ney. A new approach to the limit theory 
of recurrent Markov chains. Transactions of the AMS, 
1978. 

Xavier Carreras, Michael Collins, and Terry Koo. Tag, 
dynamic programming, and the perceptron for efficient, 
feature-rich parsing. In CoNLL, 2008. 

E Charniak, M Johnson, M Eisner, J Austerweil, D Ellis, 












Learning Fast-Mixing Models for Structured Prediction 


I Haxton, C Hill, R Shrivaths, J Moore, M Pozar, et al. 
Multilevel coarse-to-fine PCFG parsing. In NAACL, 
2006. 

Michael Collins and Terry Koo. Discriminative reranking 
for natural language parsing. Computational Linguistics, 
31(l):25-70, 2005. 

JN Corcoran and RL Tweedie. Perfect sampling of Harris 
recurrent Markov chains, preprint, 1998. 

MK Cowles and BP Carlin. Markov chain Monte Carlo 
convergence diagnostics: a comparative review. Journal 
of the American Statistical Association, 1996. 

H Daume III, J Langford, and D Marcu. Search-based 
structured prediction. Machine learning, 2009. 

W Doeblin. Elements d’une theorie generale des chaines 
simples constantes de markoff. In Annales scientifiques 
de I’Ecole Normale Superieure, volume 57, pages hi¬ 
ll 1. Societe mathematique de France, 1940. 

J Duchi, E Hazan, and Y Singer. Adaptive subgradient 
methods for online learning and stochastic optimization. 
Journal of Machine Learning Research, pages 2121- 
2159, 2011. 

David J Earl and Michael W Deem. Parallel tempering: 
Theory, applications, and new perspectives. Physical 
Chemistry Chemical Physics, 7(23):3910-3916, 2005. 

A Gelman and XL Meng. Simulating normalizing con¬ 
stants: From importance sampling to bridge sampling to 
path sampling. Statistical science, 1998. 

A Gelman and DB Rubin. A single series from the Gibbs 
sampler provides a false sense of security. Bayesian 
statistics, 1992. 

PJ Green. Reversible jump Markov chain Monte 
Carlo computation and Bayesian model determination. 
Biometrika, 1995. 

Chunhui Gu, Joseph J Lim, Pablo Arbelaez, and Jitendra 
Malik. Recognition using regions. In CVPR, pages 
1030-1037. IEEE, 2009. 

S Gulwani and N Jojic. Program verification as probabilis¬ 
tic inference. In ACM SIGPLANNotices, 2007. 

Liang Huang, Suphan Fayong, and Yang Guo. Struc¬ 
tured perceptron with inexact search. In Proceedings of 
the 2012 Conference of the North American Chapter of 
the Association for Computational Linguistics: Human 
Language Technologies, pages 142-151. Association for 
Computational Linguistics, 2012. 


Alex Kulesza and Fernando Pereira. Structured learning 
with approximate inference. In Advances in neural in¬ 
formation processing systems, pages 785-792, 2007. 

DA Levin, Y Peres, and EL Wilmer. Markov chains and 
mixing times. 2009. 

SP Meyn and RL Tweedie. Computable bounds for geo¬ 
metric convergence rates of Markov chains. The Annals 
of Applied Probability, 1994. 

DJ Murdoch and PJ Green. Exact sampling from a contin¬ 
uous state space. Scandinavian Journal ofStat., 1998. 

I Murray and R Salakhutdinov. Notes on the KL- 
divergence between a Markov chain and its equilibrium 
distribution. 2008. 

S Petrov. Coarse-to-fine natural language processing. 
2011 . 

JG Propp and DB Wilson. Exact sampling with coupled 
Markov chains and applications to statistical mechan¬ 
ics. Random structures and Algorithms, 9(1-2):223-252, 
1996. 

GO Roberts and RL Tweedie. Bounds on regeneration 
times and convergence rates for Markov chains. Stochas¬ 
tic Processes and their applications, 80(2):211-229, 
1999. 

JS Rosenthal. Minorization conditions and convergence 
rates for markov chain monte carlo. JASA, 1995. 

Evan Sandhaus. The new york times annotated corpus. Lin¬ 
guistic Data Consortium, Philadelphia, 6(12):e26752, 
2008. 

B Sapp, A Toshev, and B Taskar. Cascaded models for 
articulated pose estimation. In ECCV, 2010. 

R Sharma and A Aiken. From invariant checking to invari¬ 
ant inference using randomized search. In CAV, 2014. 

Libin Shen, Anoop Sarkar, and Franz Josef Och. Discrim¬ 
inative reranking for machine translation. In NAACL, 
pages 177-184, 2004. 

Tianlin Shi, Jacob Steinhardt, and Percy Liang. Learning 
where to sample in structured prediction. 2015. 

RS Sutton, D Mcallester, S Singh, and Y Mansour. Policy 
gradient methods for reinforcement learning with func¬ 
tion approximation. In NIPS, 2000. 

Paul Viola and Michael J Jones. Robust real-time face de¬ 
tection. International journal of computer vision, 57(2): 
137-154, 2004. 




Learning Fast-Mixing Models for Structured Prediction 


Martin J Wainwright. Estimating the wrong graphical 
model: Benefits in the computation-limited setting. The 
Journal of Machine Learning Research, 7:1829-1859, 
2006. 

David Weiss, Benjamin Sapp, and Ben Taskar. Sidestep¬ 
ping intractable inference with structured ensemble cas¬ 
cades. \nNIPS, volume 1281, pages 1282-1284, 2010. 

Payman Yadollahpour, Dhmv Batra, and Gregory 
Shakhnarovich. Discriminative re-ranking of diverse 
segmentations. In CVPR, pages 1923-1930. IEEE, 
2013. 




Learning Fast-Mixing Models for Structured Prediction 


A. Proofs 

Proof of Proposition 3.1. We will in fact show that, for all k > 1, Xk{A) = (1 — e)Afe(A), with the same eigenvector 

for both matrices. In other words, while the stationary distribution of A is different from A, all other eigenvectors are 

unchanged. 

Let Wk be the eigenvector of A corresponding to Afc. First note that Wk = 0. This is because 

1^ Awk = l^Wk ( 8 ) 

since A is stochastic, and 

Awk = Xkl^ Wk (9) 

since Wk is an eigenvector of A. Since Xk f 1, this implies that Wk = 0. 

Now we have 

Awk = (1 — e)Awk + eul^ Wk 
= (1 - e)XkWk, 

which proves that Afe(A) = (1 — e)Xk{A). In particular, A 2 (A) = (1 — e)A 2 (Al) < 1 — e. The mixing time of A is -— ^ , 

1 —A2(-A) 

and is therefore upper bounded by i, which completes the proof. □ 

Proof of Proposition 3.2. We can verify algebraically that AH = tt, as follows: 

Air = (1 — e)ATr + ertl^Tr 

= e(l — e)A{I — (1 — e)A)~^u + eu 
= e [(1 - e)A{I - (1 - e)A)-i +l]u 
= e [(1 - e)Al + (/-(!- e)A)] (/ - (1 - e)A)-^u 
= €{I-{1-€)A)-^U 
= 

SO that TT is indeed the stationary distribution of A. □ 


Proof of Proposition 3.5. From the characterization of tt in Proposition 3.2, we know that tt is equal to 

OO 

— ef A^ u. (10) 

3=0 

The rest of the proof consists of determining some useful properties of ci^(7r'). The most important (and the motivation for 
defining in the first place) is given in the following lemma, which we prove separately: 

Lemma A.l. IfTT is the stationary distribution of A and A satisfies detailed balance, then d,r(A7r') < A2(A)d^(7r'). 


The other important property of dj^iTr') is convexity: + (1 — w)tt”) < wd^ritr') + (1 — u>)d^(7r"), which follows 

directly from the characterization of d,r(7r') as a Mahalanobis distance. 


Putting these two properties together, we have 

OO 

d-TriTr) < £^^(1 ~ ef dTr{A^u) 


3=0 

OO 


< e^(l - ef X2iAy d.^{u) 

3=0 

e 


< 


< 


l-(l-e)A2(A) 

'd-K (^) ; 


^TT (^) 


1 — A 2 (^) 
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which completes the proof. 


□ 


Proof of Lemma A. 1. Recall we want to show that (i^(A7r') < A2(j4)(i^(7r'). To see this, first define S = 

diag(7r)“^/^yl diag(7r)^/^, which is symmetric by the detailed balance condition, and satisfies \k{S) = Afc(A) by similar¬ 
ity. Furthermore, the top eigenvector of S is diag(7r)^/^. Putting these together, we have 

fi^(A7r') = II diag(7r)“^/^(7r - A7r')||2 

= II diag(7r)"^/2A(7r - 7r')||2 

= ll'S'diag(7r)"^/^(7r - 7r')||2 
< A2(S')|1 diag(7r)"^/^(7r - 7r')||2 

= \2[S)d^W) 

= \2{A)d.^{'K'). 

The inequality step ||5'diag(7r)“^/^(7r — 7r')||2 < A2(<S')|1 diag(7r)“^/^(7r — 7r')||2 follows because diag(7r)“^/^(7r — tt') 
is orthogonal to the top eigenvector diag(7r)^/^ of S. □ 

Proof of Proposition 3.4. For any value of e, the stationary distribution of (1 — e)A + eul^ can be obtained by applying 
A a Geometric(e)-distributed number of times to u (by Proposition 3.2). For £2 < ei. it therefore suffices to construct 
a random variable F > 0 such that if s ^ F" and t ~ Geometric(ei) then s + f ^ Geometric(e 2 ); if we can do this, 
then we can let B be the matrix that applies A an F"-distributed number of times, and we would have Btt^^ = 
but B would clearly have stationary distribution tt, and so Lemma 3.3 would give KL (tt || Tfe^) < KL (tt || tTeJ and 
KL (tt^^ 11 tt) < KL (TTej^ II tt), which is the desired result. 

To construct the desired F, we use the fact that addition of random variables corresponds to convolution of the probability 
mass functions, and furthermore represent the probability mass functions as formal power series; in particular, we let 


/(a:) = ^ P[f = n I f ~ F]x", 


n—O 


and similarly 


geix) = ^ P[f = n I f ^ Geometric(e)]x” 

n—0 
00 

= ^e(l-e)V 


( 11 ) 


n=0 


1 — (1 — e)x' 

We want f{x)gei (x) to equal g^^ (x), so we take 

9e2 jx) 

9ei {x) 

£2 1 - (1 - ei)a; 

£1 1 - (1 - e2)x 

/ 00 

£2 


fix) = 


= - 1 + E [(1 - ^ 2 )” - (1 - e2)”-'(l - £l)] 


£1 


n—1 



From this we see that the random variable F with probability function 


P[f = n I f ~ F] = 


^ : n = 0 

f^(l-£ 2 )"-H£i-£ 2 ) : n>0 


( 12 ) 
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satisfies the required property that F + Geometric(ei) = Geometric( 52 ). Note that the condition €2 < ei is necessary 
here so that all of the probability masses are positive in the above expression. 

We can also prove the result purely algebraically. Motivated by the above construction, we define 

B = 62(1 — (1 — £2)^) ^ [ei(^ ~ (1 ~ £1)^) 

= — [-^ + (^1 ~ £2)^(1 — (1 — £2)^) ■ 

El 


By construction we have Btt^^ = but Taylor expanding the second expression for B shows that we can write it as 
a (infinite) convex combination of non-negative powers of A, and hence that B has stationary distribution tt. This again 
yields the desired result by Lemma 3.3. □ 


Proof of Theorem 3.6. We use an equivalent characterization of the strong Doeblin parameter as the quantity r(A) 
infy' A(jj I y'). In the context of the Markov chain M, this yields 


r(M^) = ^ inf P[z6,2/t, I zo,yo] 

— {zQ^yo) 


{zb,yb] 


= N X! I yo] X F[zb, yb\Ta = t] 

i^b.Vb) 


> V' V' inf P[ra = r I 2/0] X P[z&, yt, | = r] 

{zb,yb) 

b 

= 'V inf P[Ta = r I yo] V' ^Zb, Vb \ Ta = r] 

_ yo , . 

(2b,2/b) 

b 

= VinfP[Ta = r I yo] 
yo 


— • 


Finally, by Proposition 3.1, the spectral gap of is at least 70 b, hence the spectral gap of M is at least j^'ja.b, which 
proves the theorem. □ 


Proof of Corollary 3.7. Note that the time to transition from f to f -I- 1 is Geometric((5i)-distributed. Suppose we start 
from an arbitrary j G {0,..., A: — 1}. Then the time t' to get to fc — 1 is distributed as J2i=j Geometric(5i). t' has 
mean and variance < 7 ^—. In particular, with probability t' lies between 0 and 

. Now, consider the time f" to get from fc — 1 to 0; this is Geometric((5fe_i)-distributed, and conditioned on 
t” being at least t" + f — is also Geometric((5fe_i)-distributed. But t" is at least [ 5 ^ 7 ;] probability 

at least (1 — > A (since < ^). Hence independently of j, t" + t' is, with probability at least 

distributed according to [ 7 ^] + Geometric(5fe_i). But t” -f t' = r by construction, and so we need only compute 
the probability that r ; but this is just the probability that the geometric distribution is less than which is 

1 — (1 — > 1 — e“^. Therefore, expanding the definitions of to and t, we have 

that 7 [ 2 / 5 fe_ij Applying Theorem 3.6 then implies that the spectral gap of M is at least ^^ 7 ^, as was to be 

shown. □ 
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Proof of Lemma 4.2. The key idea is to use the identity ^ = f{x) 


dx 


in two places. We have 


. ^ g\9'^ogpe{z £ S) d , g. 

Pe{z e S) - — -= -^P 6 [z e 5 ) 


dpg{z) 

de 


dz 


= / Peiz) 
JS 


d log pg{z) 


dz 


= Peiz£S)E, 


de 

d\ogpe{z) 


de 


z £ S 


which completes the lemma. 


□ 


B. Correctness of Importance Sampling Algorithm 

In Section 4.1 of the main text, we had a distribution u over y and a Markov chain A{yt \ yt-i) on the same space. We 

then built a distribution over y* Ut>o{^} ^ sampling T ^ Geometric(e), yo ~ u, and yt \ yt-i A for 

y = 1,..., T (we use priyo-.T) to represent the distribution over yo-x given T). 

For a given y, we were interested in constructing an importance sampler for (T, yo-.x) \ Vt = y- The following Lemma 
establishes the correctness of the importance sampler that was presented. We assume we are interested in computing the 
expectation of some function g : y* —>■ IS. and show that the given importance weights correctly estimate E[y]. 

Lemma B.l. For a distribution F, suppose that we sample T ^ F and then sample yo-.T-i Pt-i- Tef Wt = 
p[T>t|r~F] I 2/t-i)- Consider the random variable 

T 

g ^wtg{t,yo:t-i,y)- (13) 


Then 


[5] ET~Geometric(£),i/0:T~PT [^(T', yO:T)I[yT y]] ■ 


(14) 


Proof. We have 


Exr^F.yo-.T-i^pig] = ET~F.yo:T-i 


yO:T-l~PT-l 


'^wtgit,yo:t-i,y) 

= ^P[r > f I T ^ F]Ey^.^_^^p^_^ [u!tg{t,yo,t-i,y)] 

t^O 

00 

= [My I yt-i)g{f yo:t-i,y)] 


00 


= 51 [git^ yo-.tMyt = y]] 




[y(T, yo:T)I[yT y]] 


as was to be shown. 


□ 














